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Abstract. Algorithms for frequent pattern mining, a popular informatics 
application, have unique requirements that are not met by any of the exist- 
ing parallel tools. In particular, such applications operate on extremely large 
data sets and have irregular memory access patterns. For efficient paral- 
lelization of such applications, it is necessary to support dynamic load bal- 
ancing along with scheduling mechanisms that allow users to exploit data 
locality. Given these requirements, task parallelism is the most promising 
of the available parallel programming models. However, existing solutions 
for task parallelism schedule tasks implicitly and hence, custom schedul- 
ing policies that can exploit data locality cannot be easily employed. In this 
paper we demonstrate and characterize the speedup obtained in a frequent 
pattern mining application using a custom clustered scheduling policy in 
place of the popular Cilk-style policy. We present PFunc, a novel task par- 
allel library whose customizable task scheduling and task priorities facili- 
tated the implementation of our clustered scheduling policy. 
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Introduction 



Algorithms for frequent pattern mining (FPM), a popular informatics applica- 
tion, exhibit certain distinct characteristics that distinguish them from traditional 
high-performance computing applications. Typically, FPM applications operate 
on large, irregular and dynamic data sets. The data (and corresponding com- 
putations) cannot be partitioned a priori, making these applications extremely 
sensitive to load balancing and scheduling. Memory access patterns are often 
data-dependent, requiring one data object's location in memory to be resolved 
before the next can be fetched. Finally, safe parallelization of FPM applications 
requires fine-grained synchronization. For all of these reasons, popular parallel 
programming models such as the data parallel and the single process multiple 
data (SPMD) models are not well-suited to FPM applications. Of the variety of 
alternative parallel programming models available, task parallelism is the most 
promising when it comes to meeting the challenges posed by FPM applications. 
The task parallel programming model is sufficiently high-level and general pur- 



pose to be able to parallelize both regular and irregular applications. However, 
existing solutions for task parallelism have shortcomings that prevent efficient 
parallelization of FPM applications. Specifically, FPM applications require cus- 
tom task scheduling policies that can exploit data locality between tasks that are 
not always related by the parent-child relationship. In most existing solutions 
for task parallelism, not only are tasks scheduled transparently from the users, 
but also, data locality between tasks is exploited only if there is a parent-child 
relationship between those two tasks. 

In this paper, we demonstrate the speedup obtained in a task parallel 
Apriori-based FPM implementation by switching from the popular Cilk-style (8j[ 
task scheduling policy to a customized "clustered" task scheduling policy. Fur- 
thermore, we collect various hardware metrics to characterize the factors that 
resulted in the speedup. To implement these two (Cilk-style and clustered) 
scheduling policies, we use PFunc, a novel library-based implementation of task 
parallelism, that allows users to customize parameters such as task scheduling 
policy and task priority. Unlike most other parallelization tools, PFunc provides 
a natural interface that enables facile implementation of our clustered schedul- 
ing policy. Through this study, we highlight the need to incorporate support for 
efficient parallelization of FPM applications in the task parallel model. 

1. Background 

Murphy and Kogge |23 [ demonstrated the differences in memory access patterns 
of informatics applications when compared to those of traditional scientific com- 
puting applications. Berry et al. [7| and Lumsdaine et al. 1221 have shown that 
current techniques applied to high performance computing are inadequate for 
informatics applications. Since FPM was introduced as a relevant problem in in- 
formatics, its memory characteristics and parallelization have been extensively 
researched Effl|9l[T6l[Tl[T9l|21J|2l|26l|29). Zaki and Parathasarathy |3T) were the 
first to explore the idea of clustering to aid in fast discovery of frequent patterns. 

Many solutions have been implemented to facilitate dynamic task paral- 
lelism. Fortran M [13), Cilk HU and OpenMP 3.0 |25) implement task paral- 
lelism as extensions to stock programming languages. Other solutions such as 
Intel's Threading Building Blocks (TBB) I2ZI, Microsoft's Parallel Patterns Li- 
brary (PPL) and Task Parallel Library (TPL), and Java Concurrency Utilities are 
library-based. All of the three HPCS languages (Chapel [10[, Fortress [6[ and 
X10 lU) offer task parallelism as language features. Scheduling of tasks has re- 
ceived wide attention in the programming community. Cilk's depth-first work [8j 
model, the X10 Work Stealing framework's (XWS) breadth-first fT2l model and 
Guo et al.'s hybrid model flTI are notable examples of work stealing schedulers. 
Each of the three above mentioned scheduling policies exploit data locality un- 
der different circumstances. For example, Cilk's scheduling policy exploits data 
locality only when the applications are deeply nested. All of the task parallel 
solutions discussed above schedule tasks implicitly, thereby disallowing cus- 
tomization of task scheduling policies. In this paper, we demonstrate the im- 
portance of customizing the task scheduling policy in parallelization tools when 



parallelizing FPM applications. We also describe the features of PFunc that en- 
able FPM applications to employ custom scheduling strategies that help them 
outperform other implementations that employ default scheduling policies pro- 
vided by existing solutions for task parallelism. 

2. Problem Description 

In this section, we describe FPM and comment on important aspects of its im- 
plementation that influence performance. Briefly the problem description is as 
follows: Let I = i 2 , •■, i n } be a set of n items, and let D = {Ti, T 2 , .., T m } be 
a set of m transactions, where each transaction TJ is a subset of /. An itemset 
i C I of size k is known as a fc-itemset. The support (that is, frequency) of i is 
S 5^=i ( 1 : i Q Tj), or informally speaking, the number of transactions in D that 
have i as a subset. The FPM problem is to find all i 6 D that have support greater 
than a user supplied minimum value. For our FPM implementation, we choose 
the Apriori [i5| algorithm. Due to its efficiency, robustness and guaranteed main 
memory footprint, the Apriori algorithm is widely used in FPM implementa- 
tions including those in commercial products such as IBM's InfoSphere Ware- 
house 1501 . Apriori traverses the itemset search space in breadth-first order. Its 
efficiency stems from its use of the anti-monotone property: If a size A:-itemset 
is not frequent, then any size (Tc+Ij-itemset containing it will not be frequent. 
The algorithm first finds all frequent 2 -items in the data set, and then iteratively 
finds all frequent fc-itemsets using the frequent (k-1 j-itemsets discovered previ- 
ously. For example, let A, B, C and D be individual items (2-itemsets) that are 
frequent in a transaction database. Then, for stage 2, AB, AC, AD, BC, BD, and 
CD are the candidate 2-itemsets. If at stage 2, after counting, 2-itemsets AB, AC 
and AD were found to be frequent, then ABC, ABD, and ACD are the candidates 
for stage 3. The frequency of a candidate A:-itemset is counted by performing a 
join ( N ) operation on the transaction-ID lists of each individual item in that par- 
ticular itemset. In our task parallel implementation of the Apriori algorithm, the 
counting operations required for each fc-itemset are executed as a separate task. 

Requirements for efficient parallelization: The Apriori algorithm is highly depen- 
dent on memory reuse for its performance f!6l . As tasks mine for itemsets, 
they access overlapping memory regions as many itemsets share transaction-ID 
lists. For example, if we were mining for 2-itemsets AB, AC, and AD, then the 
transaction-ID list of A would be common in the mining operations. The greater 
the overlap of items in the itemsets, the greater the potential for memory reuse 
between their respective tasks. Exploiting such inter-task data localities through 
locality-aware scheduling of tasks is the key for efficient parallel execution of 
Apriori-based FPM applications. 

Shortcomings of current solutions: Existing task parallel solutions schedule tasks 
implicitly using different flavors of the Cilk-style work stealing (8j task sched- 
uler and hence, do not support custom task scheduling policies that can ex- 
ploit data localities between user-specified tasks. Furthermore, in the Cilk-style 
scheduling policy, there is little or no contention when a thread executes tasks 



that are on its own task queue, but stealing a task from another thread's task 
queue is an expensive operation. Cilk-style work stealing benefits applications 
that are deeply nested (that is, recursive) in nature. Nested tasks, by definition, 
spawn other tasks. Therefore, when a thread steals a nested task, it implicitly 
gains all the tasks that will be generated by the stolen task-thereby minimiz- 
ing task stealing. However, in the Apriori algorithm, tasks are non-nested as we 
traverse the search space in breadth-first order. Hence, once a thread runs out 
of work, it must steal tasks repeatedly from other victim threads' task queues 
in order to keep itself busy. Such work stealing is detrimental to Apriori-based 
FPM implementations' performance both because of the increased contention on 
victim threads' task queues and the lack of data locality among stolen tasks. 

3. PFunc: A new tool for task parallelism 

PFunc |20ll is a lightweight and portable task parallel library for C and C++ users, 
which has been designed using the generic programming paradigm fl5ll to over- 
come some of the shortcomings of existing solutions for task parallelism. Due to 
space constraints, we describe only those features of PFunc that affect Apriori- 
based FPM implementations. PFunc allows users to choose the task scheduling 
policy at compile time. Users can either choose from built-in scheduling policies 
such as Cilk-style, first-in first-out (FIFO), last-in first-out (LIFO) and priority- 
based, or they can supply their own scheduling policy. All scheduling policies 
are "models" of the scheduler "concept". This generic design enforces a uni- 
form interface across the different scheduling policies and enables compile time 
plug-and-play capabilities with no runtime penalty. By default, PFunc follows 
the work stealing model in which each thread has its own task queue and tasks 
are enqueued on the task queue of the thread that spawned the task. Users have 
the option to override this default setting at runtime and place tasks onto a par- 
ticular thread's task queue (that is, change the task's affinity). PFunc also pro- 
vides task attributes that can be attached to each individual task when spawning 
the task. Task attributes, such as task priority, are an important tool in the imple- 
mentation of many scheduling policies. For example, when using priority-based 
scheduling, users specify a task's priority using the task attribute mechanism. 
Like the scheduler, task attributes can be customized to suit the task scheduling 
policy. To enable hardware profiling, PFunc is fully integrated with the Perfor- 
mance Application Programming Interface (PAPI) [ 1 1 . 

4. Clustered task scheduling 

To efficiently parallelize our Apriori-based FPM implementation using the task 
parallel programming model, we have designed a clustered scheduling policy. 
In this policy, tasks are clustered together such that there is a greater likelihood 
of memory reuse between the clustered tasks (for example, tasks mining for 3- 
itemsets ABC and ABD) as they are executed by the same thread. Furthermore, 
we employ a custom task stealing policy in which clusters of tasks are stolen 



instead of a single task. Such clustered stealing not only reduces contention on 
the thread-local task queues by minimizing the number of required steals, but 
also maintains data locality between the stolen tasks. 

Implementation: In our FPM implementation, each task performs the mining op- 
erations required for one A:-itemset. Each fc-itemset is stored as a sorted set of in- 
dividual items. To cluster tasks that have significant memory overlap, we use a 
common prefix of length (k-1). For example, the 3-itemsets ABC and ABD share 
the 2-prefix AB and hence, their respective tasks are clustered together. For effi- 
cient execution of our FPM application, it is necessary to ensure that such clus- 
ters of tasks have a high likelihood of being executed by the same thread. To 
this end, we use a hash table (std :: hash_map) as the task queue for each thread. 
This is a departure from the Cilk model [14j, in which each thread's task queue 
is a deque. With the hash table structure in place, clustering of tasks is achieved 
by placing all fc-itemsets that have a common (k-1) prefix into the same bucket. 
To achieve such clustering, we used the following hash function to compute the 
hash of a fc-itemset. First, the hash of each of its first (k-1) items is computed using 
C++ Standard Library's |28l std::hash function object. Then, these individual 
hashes are XOR'ed together to produce one final hash. For example, the hash for 
the 3-itemset ABC is computed by XOR-ing the results of applying std "hash to 
A and B separately The hashes thus computed for the 3-itemsets ABC and ABD 
are equal and hence, their respective tasks are placed in the same bucket. Each 
thread executes the tasks placed in its task queue (that is, hash table) by iterat- 
ing through its task queue's buckets starting from the first non-empty bucket. 
When a thread runs out of tasks on its own task queue, it randomly selects a 
victim thread to steal tasks. Then, it steals the first non-empty bucket from the 
victim thread's task queue. This stealing policy is better equipped than the Cilk- 
style stealing policy to avoid repeated stealing in our FPM implementation as 
potentially, more than one task can be stolen during each steal. Furthermore, by 
stealing an entire bucket of tasks, data locality between stolen tasks is preserved. 

Integrating with PFunc: Realizing our customized clustered task scheduling pol- 
icy in PFunc is achieved in two steps. First, the entire scheduling policy is im- 
plemented to model PFunc's scheduler concept. This enables us to pick our clus- 
tered scheduling policy as the task scheduling policy of choice at compile time. 
Second, we customize PFunc's task attributes (again, at compile time) such that 
we can attach a reference to the /c-itemset that needs to be mined as the respective 
task's priority. When a task is spawned, our hash function uses its priority (that 
is, the associated fc-itemset's reference) to determine the appropriate bucket (that 
is, cluster) for this task in the spawning thread's task queue. 

5. Results 

In this section, we present the results of running our Apriori-based FPM im- 
plementation using the Cilk-style and the clustered scheduling policies. PFunc 
was used for parallelization of our FPM implementation and the scheduling pol- 
icy for a particular run was chosen at compile time. Minimal code modifica- 
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Figure 1. Graph showing the normalized runtimes of different datasets when using the Cilk-style 
and Clustered task scheduling policies in our Apriori-based FPM implementation with 8 threads. 
Support (that is, frequency) for each of the datasets is given in Table[l] 



Dataset 


Support 


IPC 


DTLB L1M/L2H 


DTLB L1M/L2M 


Cilk 


Cluster 


Cilk 


Cluster 


Cilk 


Cluster 


accidents 


0.25 


0.595689 


0.603959 


0.000048 


0.000046 


0.000161 


0.000110 


chess 


0.6 


0.560538 


0.668965 


0.000797 


0.000242 


0.001006 


0.000032 


connect 


0.8 


0.543099 


0.809308 


0.000249 


0.000112 


0.001204 


0.000141 


kosarak 


0.0013 


0.692103 


0.717599 


0.000400 


0.000185 


0.000659 


0.000123 


pumsb 


0.75 


0.494539 


0.719072 


0.000230 


0.000114 


0.001276 


0.000126 


pumsb_star 


0.3 


0.527659 


0.698358 


0.000315 


0.000145 


0.001082 


0.000113 


mushroom 


0.10 


0.570390 


0.705003 


0.000477 


0.000267 


0.000950 


0.000022 


T40I10D100K 


0.005 


0.627272 


0.727288 


0.000368 


0.000305 


0.000900 


0.000021 


T10I4D100K 


0.00006 


0.555330 


0.716282 


0.000218 


0.000144 


0.000876 


0.000044 



Table 1. Instructions-per-cycle (IPC), LI data TLB misses (L1M/L2H) and L2 data TLB misses 
(L1M/L2M) when using the Cilk-style and Clustered task scheduling policies in our Apriori-based 
FPM implementation with 8 threads for different datasets from the FIMI repository lo- 



tion was required to attach the fc-itemset reference as task priority when spawn- 
ing the tasks for the clustered scheduling policy. We ran our experiments on a 
four socket, quad-core (total 16 cores) AMD 8356 processor running Linux Ker- 
nel 2.6.24. For compilation, we used GCC 4.3.2 with: "-03 -fomit-frame-pointer 
-funroll-loops". To collect hardware metrics, we used PFunc's integration with 
PAPI. The benchmarks were run on data sets from the FIMI repository (J)- 

Figured] depicts the runtimes of our FPM implementation for both the Cilk- 
style and the clustered task scheduling policies. Runtimes were recorded with 
hardware profiling turned off and were averaged after 5 runs. The clustered 
scheduling policy runs significantly faster (more than 50%) for most of the data 
sets, with the accidents.dat data set being the only exception. To test our hy- 
pothesis that our clustered scheduling policy exploits data locality better than 
the Cilk-style policy, we collected various hardware metrics. Table[T]summarizes 
some of the important metrics. Our clustered scheduling policy delivers more 



instructions per clock cycle than the Cilk-style scheduling policy for all data 
sets. Also, our clustered scheduling policy incurs far fewer L2 data TLB misses 
than the Cilk-style scheduling policy. This is because the benefits of clustering 
tasks that have a significant memory overlap outweigh the additional opera- 
tional costs incurred due to using hash tables in our clustered scheduling policy. 
When results from Figured] and Table Q] are taken together, we can conclude that 
the clustered scheduling policy exploits data locality better than the Cilk-style 
scheduling policy for our Apriori-based FPM implementation. 

6. Conclusion and Future Work 

We have demonstrated that our custom clustered scheduling policy performs 
better than the popular Cilk-style scheduling policy for an Apriori-based FPM 
implementation. This was made possible by PFunc, which provides better sup- 
port for parallelization of FPM applications than existing solutions for task par- 
allelism by allowing facile customization of task scheduling policy and task 
attributes. An interesting topic for future research is to implement a dynamic 
task scheduling policy that utilizes a multi-dimensional index structure as task 
queues. This would enable a thread to dynamically pick the "nearest-neighbor" 
of the previously executed task as its next task to execute. 
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